Categorical Attribute traNsformation Environment (CANE): A python module for categorical to numeric data preprocessing

نویسندگان

چکیده

Categorical Attribute traNsformation Environment (CANE) is a simpler but powerful data categorical preprocessing Python package. The package valuable since there currently large range of Machine Learning (ML) algorithms that can only be trained using numerical (e.g., Deep Learning, Support Vector Machines) and several real-world ML applications are associated with attributes. Currently, CANE offers three to numeric transformation methods, namely: Percentage Pruned (PCP), Inverse Document Frequency (IDF) One-Hot-Encoding method. Additionally, the module well documented code examples help in its adoption by non expert users.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach

Clustering is a widely used technique in data mining applications for discovering patterns in underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either numeric or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining applications. In this paper, we propose a novel divide-and-conquer techniq...

متن کامل

A Divisive Ordering Algorithm for Mapping Categorical Data to Numeric Data

The amount of computing time for K Nearest Neighbor Search is linear to the size of the dataset if the dataset is not indexed. This is not endurable for on-line applications with time constraints when the dataset is large. However, if there are categorical attributes in the dataset, an index cannot be built on the dataset. One possible solution to index such datasets is to convert categorical a...

متن کامل

Clustering Large Data Sets with Mixed Numeric and Categorical Values

Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The k-means based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we pres...

متن کامل

Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering

The K-modes clustering algorithm is well known for its efficiency in clustering large categorical datasets. The K-modes algorithm requires random selection of initial cluster centers (modes) as seed, which leads to the problem that the clustering results are often dependent on the choice of initial cluster centers and non-repeatable cluster structures may be obtained. In this paper, we propose ...

متن کامل

Systematic Search for Categorical Attribute-value Data-driven Machine Learning

Optimal Pruning for Unordered Search is a search algorithm that enables complete search through the space of possible disjuncts at the inner level of a covering algorithm. This algorithm takes as inputs an evaluation function, e, a training set, t, and a set of specialisation operators, o. It outputs a set of operators from o that creates a classifier that maximises e with respect to t. While O...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Software impacts

سال: 2022

ISSN: ['2665-9638']

DOI: https://doi.org/10.1016/j.simpa.2022.100359